sketch out improved performance by refactoring codec pipeline logic by d-v-b · Pull Request #3719 · zarr-developers/zarr-python

d-v-b · 2026-02-25T22:15:39Z

This builds on top of #3715 and achieves even more perf improvements by refactoring the basic logic of codec encoding / decoding. The design document behind these changes is here.

I do not think this is merge-worthy, as it's far too big. But i'm going to post the performance gains, and start figuring out how to break this into pieces.

A big feature this adds is the ability to write individual subchunks to uncompressed shards on storage backends that support range writes (local and memory).

Benchmark comparison: perf/smarter-codecs vs main

Test Name	perf/smarter-codecs (ms)	main (ms)	Speedup
test_write_array[memory-chunks=100,shards=1M-None]	46.85	1018.60	21.74×
test_write_array[local-chunks=100,shards=1M-None]	68.45	1006.27	14.70×
test_sharded_morton_indexing[(32,32,32)]	22.45	247.49	11.02×
test_slice_indexing[None-(0,0,0)]	0.03	0.28	10.05×
test_sharded_morton_indexing_large[(33,33,33)]	254.85	2521.32	9.89×
test_slice_indexing[(50,50,50)-full_slice]	9.12	89.80	9.85×
test_sharded_morton_indexing_large[(32,32,32)]	226.80	2153.47	9.50×
test_sharded_morton_indexing_large[(30,30,30)]	181.86	1725.29	9.49×
test_slice_indexing[(50,50,50)-(0,0,0)]	0.06	0.60	9.45×
test_write_array[memory-chunks=100,shards=1M-gzip]	211.85	1978.29	9.34×
test_slice_indexing[None-(slice(None,10,None))*3]	0.03	0.28	9.28×
test_write_array[local-chunks=100,shards=1M-gzip]	217.00	1965.37	9.06×
test_slice_indexing[(50,50,50)-strided_4]	8.96	80.44	8.98×
test_slice_indexing[(50,50,50)-strided_4_offset]	4.83	43.21	8.95×
test_sharded_morton_indexing[(16,16,16)]	2.78	23.26	8.37×
test_slice_indexing[(50,50,50)-(slice(None,10,None))*3]	0.07	0.60	8.36×
test_slice_indexing[None-full_slice]	11.07	85.12	7.69×
test_read_array[memory-chunks=100,shards=1M-gzip]	181.44	1372.13	7.56×
test_read_array[memory-chunks=100,shards=1M-None]	80.73	609.53	7.55×
test_read_array[memory-chunks=1K,no_shards-None]	7.21	53.52	7.43×
test_read_array[local-chunks=100,shards=1M-None]	83.79	612.18	7.31×
test_slice_indexing[None-strided_4]	12.19	84.98	6.97×
test_read_array[memory-chunks=1K,no_shards-gzip]	16.79	115.69	6.89×
test_write_array[memory-chunks=1K,no_shards-gzip]	32.54	219.93	6.76×
test_read_array[local-chunks=100,shards=1M-gzip]	190.20	1277.12	6.71×
test_read_array[local-chunks=1K,no_shards-None]	24.23	142.33	5.87×
test_slice_indexing[None-mixed_slice]	0.12	0.71	5.84×
test_read_array[local-chunks=1K,no_shards-gzip]	37.83	216.45	5.72×
test_slice_indexing[None-strided_4_offset]	5.77	32.55	5.64×
test_write_array[memory-chunks=1K,no_shards-None]	19.16	106.12	5.54×
test_slice_indexing[(50,50,50)-mixed_slice]	0.23	1.21	5.29×
test_slice_indexing[(50,50,50)-strided_4-get_latency]	15.82	82.14	5.19×
test_write_array[memory-chunks=1K,shards=1K-gzip]	100.60	451.93	4.49×
test_read_array[local-chunks=1K,shards=1K-None]	66.94	298.83	4.46×
test_write_array[memory-chunks=1K,shards=1K-None]	71.06	315.12	4.43×
test_read_array[memory-chunks=1K,shards=1K-None]	43.90	192.21	4.38×
test_sharded_morton_single_chunk[(32,32,32)]	0.18	0.74	4.12×
test_read_array[memory-chunks=1K,shards=1K-gzip]	68.96	283.15	4.11×
test_read_array[local-chunks=1K,shards=1K-gzip]	90.06	368.79	4.09×
test_sharded_morton_single_chunk[(33,33,33)]	0.19	0.76	4.03×
test_slice_indexing[(50,50,50)-full_slice-get_latency]	20.53	79.62	3.88×
test_sharded_morton_single_chunk[(30,30,30)]	0.20	0.72	3.69×
test_slice_indexing[(50,50,50)-strided_4_offset-get_latency]	17.73	50.27	2.84×
test_slice_indexing[None-strided_4_offset-get_latency]	19.29	48.98	2.54×
test_slice_indexing[None-strided_4-get_latency]	43.55	102.36	2.35×
test_slice_indexing[None-full_slice-get_latency]	46.77	101.82	2.18×
test_write_array[local-chunks=1K,shards=1K-gzip]	367.61	733.76	2.00×
test_slice_indexing[(50,50,50)-(0,0,0)-get_latency]	0.49	0.87	1.79×
test_slice_indexing[(50,50,50)-(slice(None,10,None))*3-get_latency]	0.50	0.88	1.77×
test_slice_indexing[None-mixed_slice-get_latency]	0.58	1.00	1.74×
test_slice_indexing[(50,50,50)-mixed_slice-get_latency]	1.12	1.92	1.72×
test_write_array[local-chunks=1K,shards=1K-None]	390.76	619.05	1.58×
test_write_array[local-chunks=1K,no_shards-None]	225.86	340.42	1.51×
test_write_array[local-chunks=1K,no_shards-gzip]	284.20	400.62	1.41×
test_slice_indexing[None-(slice(None,10,None))*3-get_latency]	0.31	0.43	1.39×
test_slice_indexing[None-(0,0,0)-get_latency]	0.33	0.43	1.28×
test_morton_order_iter[(30,30,30)]	104.17	123.84	1.19×
test_morton_order_iter[(16,16,16)]	2.56	3.01	1.18×
test_morton_order_iter[(10,10,10)]	7.44	8.61	1.16×
test_morton_order_iter[(8,8,8)]	0.31	0.36	1.15×
test_morton_order_iter[(33,33,33)]	644.86	740.97	1.15×
test_morton_order_iter[(20,20,20)]	78.57	88.98	1.13×
test_sharded_morton_write_single_chunk[(33,33,33)]	668.32	754.39	1.13×
test_morton_order_iter[(32,32,32)]	22.80	24.85	1.09×
test_sharded_morton_write_single_chunk[(30,30,30)]	130.08	129.83	1.00×
test_sharded_morton_write_single_chunk[(32,32,32)]	48.33	32.43	0.67×

…thon into perf/faster-codecs

…ospection more efficient

…into perf/faster-codecs

d-v-b added 30 commits February 18, 2026 21:48

sketch out sync codecs + threadpool

f427898

Merge branch 'main' into perf/faster-codecs

dbdc3d4

fix perf regressions

65d1230

Merge branch 'perf/faster-codecs' of https://github.com/d-v-b/zarr-py…

e24fe7e

…thon into perf/faster-codecs

add partial encode / decode

f979eaa

add sync hotpath

a934899

add comments and documentation

b53ac3e

refactor sharding to allow sync

73ac845

fix array spec propagation

aeecda8

fix countingdict tests

69172fb

update design doc

28d0def

dynamic pool allocation

f8e39e6

default to 1 itemsize for data types that don't declare it

b388911

Merge branch 'main' into perf/faster-codecs

7e29ef3

Merge branch 'main' into perf/faster-codecs

00dde0b

remove extra codec pipeline

9d77ca5

remove garbage

88a4875

lint

284e5e2

use protocols for new sync behavior

b1b876a

remove batch size parameter; add changelog entry

6996284

prune dead code, make protocols useful

204dda1

restore batch size but it's only there for warnings

e9db616

fix type hints, prevent thread pool leakage, make codec pipeline intr…

01e1f73

…ospection more efficient

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

fbde3af

…into perf/faster-codecs

restore old comments / docstrings

11534b0

simplify threadpool management

b40d53a

use isinstance instead of explicit list of codec names

83c1dc1

consolidate thread pool configuration

e8a0cc6

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

9a1d5eb

…into perf/faster-codecs

Merge remote-tracking branch 'origin/main' into perf/smarter-codecs

9071954

d-v-b added 2 commits February 25, 2026 16:58

execute performance improvement plan

3297e0d

Merge branch 'main' into perf/smarter-codecs

0766289

d-v-b mentioned this pull request Feb 25, 2026

chunk encoding performance improvements #3720

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sketch out improved performance by refactoring codec pipeline logic#3719

sketch out improved performance by refactoring codec pipeline logic#3719
d-v-b wants to merge 32 commits intozarr-developers:mainfrom
d-v-b:perf/smarter-codecs

d-v-b commented Feb 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

d-v-b commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark comparison: perf/smarter-codecs vs main

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

d-v-b commented Feb 25, 2026 •

edited

Loading